This dataset contains audio statistics of the top 2000 tracks on Spotify from 1998-2020. The dataset was retrieved from Kaggle. The data includes about 18 columns each describing the track and its qualities. We are particularly interested in the genre of the track, valence, tempo, loudness, mode, key, energy, danceability, year, song title, and artist.
We want to review the following objectives:
artist: Name of the Artist.
song: Name of the Track.
duration_ms: Duration of the track in milliseconds.
explicit: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children.
year: Release Year of the track.
popularity: The higher the value the more popular the song is.
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. genre: Genre of the track.
What genres were included in the top hits of songs from 1998-2020?
SpotifyTopHits_genre <- trimws(unlist(str_split(SpotifyTopHits$genre, ",")))
y <- data.frame(SpotifyTopHits_genre)
p <- plot_ly(x = table(y)["blues"], type="bar", name = 'Blues') %>%
layout(
title = "Song Genres of Spotify Top Hits",
xaxis = list(title = "Song Count"),
yaxis = list(title = "Genres")
)
p <- add_trace(p, x = ~table(y)["classical"], name = 'Classical')
p <- add_trace(p, x = ~table(y)["country"], name = 'Country')
p <- add_trace(p, x = ~table(y)["Dance/Electronic"], name = 'Dance/Electronic')
p <- add_trace(p, x = ~table(y)["Folk/Acoustic"], name = 'Folk/Acoustic')
p <- add_trace(p, x = ~table(y)["hip hop"], name = 'hip hop')
p <- add_trace(p, x = ~table(y)["jazz"], name = 'jazz')
p <- add_trace(p, x = ~table(y)["latin"], name = 'latin')
p <- add_trace(p, x = ~table(y)["metal"], name = 'metal')
p <- add_trace(p, x = ~table(y)["pop"], name = 'pop')
p <- add_trace(p, x = ~table(y)["R&B"], name = 'R&B')
p <- add_trace(p, x = ~table(y)["rock"], name = 'rock')
p <- add_trace(p, x = ~table(y)["set"], name = 'set')
p <- add_trace(p, x = ~table(y)["World/Traditional"], name = 'World/Traditional')
pPop music dominants all genres in the Top Spotify Hits data set. This is rather unsurprising since pop music has catchy rhythms that make us want to sing along and dance. The lyrics usually repeat themselves, which makes them easy to remember. Pop music also typically revolves around the same themes and topics, which makes it easy to enjoy. We will take a closer look and isolate pop music to see if there is a strong correlation between the attributes
The most number of songs are in the C major key, followed by A#/Bb, and then songs that are not categorized in a particular key. Coming in closely behind N/A are songs in the F#/Gb key. It seems that since 1998, the least amount of Spotify Top Hit songs are in the D key. An interesting fact, D major key songs are typically the keys of triumph, warcries, victories, which would make sense as songs about these themes could be less listened to.
Out of the dataset, 1,449 songs were not explicit while 551 songs were labeled explicit. This could contribute to Spotify Top Hits criteria since this data set is not limited to a certain age group.
Based on the aggregated data from 1998 - 2020, most songs had a BPM of 128.
It appears that from 1998 - 2020, the BPM, tempo has increased over time. This looks like a steady increase and supports our previous finding that most songs were pulled in the 128 BPM range.
As mentioned above, are we able to see any correlations between the attributes of song?
## Top 3 for attributes: duration_ms
## Top Correlated Attributes: explicitness speechiness popularity
## Top Correlation Values: 0.12 0.07 0.05
##
## Top 3 for attributes: year
## Top Correlated Attributes: tempo explicitness danceability
## Top Correlation Values: 0.08 0.08 0.03
##
## Top 3 for attributes: popularity
## Top Correlated Attributes: duration_ms explicitness loudness
## Top Correlation Values: 0.05 0.05 0.03
##
## Top 3 for attributes: danceability
## Top Correlated Attributes: valence explicitness speechiness
## Top Correlation Values: 0.4 0.25 0.15
##
## Top 3 for attributes: energy
## Top Correlated Attributes: loudness valence liveness
## Top Correlation Values: 0.65 0.33 0.16
##
## Top 3 for attributes: key
## Top Correlated Attributes: valence danceability year
## Top Correlation Values: 0.04 0.03 0.01
##
## Top 3 for attributes: loudness
## Top Correlated Attributes: energy valence liveness
## Top Correlation Values: 0.65 0.23 0.1
##
## Top 3 for attributes: mode
## Top Correlated Attributes: tempo explicitness liveness
## Top Correlation Values: 0.05 0.05 0.03
##
## Top 3 for attributes: speechiness
## Top Correlated Attributes: explicitness danceability duration_ms
## Top Correlation Values: 0.42 0.15 0.07
##
## Top 3 for attributes: acousticness
## Top Correlated Attributes: year popularity duration_ms
## Top Correlation Values: 0.03 0.02 0.01
##
## Top 3 for attributes: instrumentalness
## Top Correlated Attributes: energy tempo danceability
## Top Correlation Values: 0.04 0.03 0.02
##
## Top 3 for attributes: liveness
## Top Correlated Attributes: energy loudness speechiness
## Top Correlation Values: 0.16 0.1 0.06
##
## Top 3 for attributes: valence
## Top Correlated Attributes: danceability energy loudness
## Top Correlation Values: 0.4 0.33 0.23
##
## Top 3 for attributes: tempo
## Top Correlated Attributes: energy year loudness
## Top Correlation Values: 0.15 0.08 0.08
##
## Top 3 for attributes: explicitness
## Top Correlated Attributes: speechiness danceability duration_ms
## Top Correlation Values: 0.42 0.25 0.12
In the three figures above, songs were analyzed to see if there is any correlation between the attributes that were calculated. Interestingly enough, there are a few correlations to note that could provide some insight into the data.
Loudness vs. Energy - loudness is calculated by decibels - energy represents a perceptual measure of intensity and activity. - the two attributes seem to have a good positive correlation at +0.65
Acousticness vs Energy - the two attributes seem to have a weak negative correlation at -0.45
Danceability vs Valence - the two attributes seem to have a weak positive correlation at +0.4
Explicitness vs Speechiness - the two attributes seem to have a weak positive correlation at -0.42
## Pop Top 3 for attributes: duration_ms
## Pop Top Correlated Attributes: liveness acousticness explicitness
## Pop Top Correlation Values: 0.08 0.07 0.07
##
## Pop Top 3 for attributes: year
## Pop Top Correlated Attributes: speechiness explicitness loudness
## Pop Top Correlation Values: 0.2 0.15 0.04
##
## Pop Top 3 for attributes: popularity
## Pop Top Correlated Attributes: acousticness duration_ms tempo
## Pop Top Correlation Values: 0.04 0.03 0.03
##
## Pop Top 3 for attributes: danceability
## Pop Top Correlated Attributes: valence energy instrumentalness
## Pop Top Correlation Values: 0.57 0.18 0.11
##
## Pop Top 3 for attributes: energy
## Pop Top Correlated Attributes: loudness valence tempo
## Pop Top Correlation Values: 0.67 0.44 0.19
##
## Pop Top 3 for attributes: key
## Pop Top Correlated Attributes: speechiness instrumentalness tempo
## Pop Top Correlation Values: 0.11 0.05 0.05
##
## Pop Top 3 for attributes: loudness
## Pop Top Correlated Attributes: energy valence danceability
## Pop Top Correlation Values: 0.67 0.29 0.1
##
## Pop Top 3 for attributes: mode
## Pop Top Correlated Attributes: acousticness duration_ms tempo
## Pop Top Correlation Values: 0.09 0.05 0.03
##
## Pop Top 3 for attributes: speechiness
## Pop Top Correlated Attributes: year tempo liveness
## Pop Top Correlation Values: 0.2 0.17 0.15
##
## Pop Top 3 for attributes: acousticness
## Pop Top Correlated Attributes: mode duration_ms popularity
## Pop Top Correlation Values: 0.09 0.07 0.04
##
## Pop Top 3 for attributes: instrumentalness
## Pop Top Correlated Attributes: energy danceability valence
## Pop Top Correlation Values: 0.13 0.11 0.1
##
## Pop Top 3 for attributes: liveness
## Pop Top Correlated Attributes: speechiness energy loudness
## Pop Top Correlation Values: 0.15 0.14 0.1
##
## Pop Top 3 for attributes: valence
## Pop Top Correlated Attributes: danceability energy loudness
## Pop Top Correlation Values: 0.57 0.44 0.29
##
## Pop Top 3 for attributes: tempo
## Pop Top Correlated Attributes: energy speechiness loudness
## Pop Top Correlation Values: 0.19 0.17 0.08
##
## Pop Top 3 for attributes: explicitness
## Pop Top Correlated Attributes: year speechiness danceability
## Pop Top Correlation Values: 0.15 0.13 0.1
Filtering out other genres helped increase our correlation attributes. This is expected as categorizing genres is another way of grouped attributes. However, since we are filter outliers to the group, we can see that we lose some correlated attributes.
Loudness vs Energy - Improved to +0.67
Acousticness vs Energy - no correlation found between two attributes
Danceability vs Valence - Improved to +0.57
Explicitness vs Speechiness - Decreased to +0.13
The chart above indicates that the distribution of songs in the Spotify Top Hits has a mean of 3.812 mins with a standard deviation of 0.652 mins. If you are planning to compose a song in the near future, it is comfortable to say that if you are outside 3 standard deviation or compose a song greater than 5.5 mins long, you might not have a hit song. Alternatively, if you compose a song less than 2 mins long, you might find yourself at a similar situation if you compose a song greater than 5.5 mins long.
## Sample Size = 10 Mean = 3.817295 SD = 0.2028002
## Sample Size = 20 Mean = 3.815265 SD = 0.1518441
## Sample Size = 30 Mean = 3.806269 SD = 0.1153173
## Sample Size = 40 Mean = 3.810629 SD = 0.1044319
“The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution. The theorem is shown with various distributions of the input data in the following sections.”
It is pretty clear that as you sample more, your distribution will look more normal, and your standard deviation should get closer and closer to 0. This data set is already good in that we continue to get a lower standard deviation value with greater sampling.
For Sampling, we wanted to see out of the top 5 Genres of Spotify Top Hits can we see a different result based on groups of various data.
Simple random sampling is when a specified sample is selected from the larger group or larger frame. In our case, each genre has an equivalent opportunity of getting selected, with a sample size of 100 across all samples. Out of the total songs of 1348, there will be 100 randomly selected without replacement.
We also took a look at Systematic Sampling where there are rules decided to pick the sample. Selection bias may occur as a result of systematic sampling if there is a pattern in the input frame. For a sample size of 100, the data is divided into 27 groups. Data will take every 27th item. If systematic sampling is computed, we may see some fluctuation in the data. In the category pop, dance/electronic, we may see an increase of songs selected in that genre. We may also see an increase in the hip hop pop category.
Lastly, we took a look at the stratified sample which occurs when the larger group of data is broken into smaller groups and then certain sizes are picked from each group. In this analysis, we looked at top 5 genres but with a sample size 50.
Throughout the analysis, it is important ot understand that this data set is not indicative of Top Hits from 1998-2020. This is a data set from Spotify using Spotify’s API, and it should not lead to conclusions outside of the application. It is clear however, that popular songs typically fall in the Pop category and there are certain attributes that are associated with the pop category. Certain genres like Jazz, Classical and Blues have trouble getting into the top hits, but that could be attributed to how popularity is calculated. The Spotify listener population tend to prefer upbeat, and uplifting songs, so it makes sense that these genres may be misrepresented. A thorough look through the Song Attributes by year can see a trend of the energy metric increasing significantly from 1998 to 2020. This is further supported by the danceability metric also steadily increasing in the same time period.